Before we do anything, lets import all the libraries that we need for our tutorial...
# install plotly (version 5.0)
# install seaborn (version 0.11.0)
# run "pip install PACKAGE-NAME" in the command line in your project's virtual environment
# or if you are using Anaconda "conda install PACKAGE-NAME" in the conda command prompt
# run conda install plotly==5.0 or pip install plotly==5.0 to make sure you have the right version.
import warnings
# dataframes and tables manipulation
import pandas as pd
import numpy as np
from datetime import datetime
import re #the regex
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# machine learning
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import statsmodels
from statsmodels.formula.api import ols
import statsmodels.api as sm
# to ignore warnings
warnings.filterwarnings('ignore')
Read and store data as pandas dataframe objects
Pandas is a library that is built on python. It is very popular in the world of data science because of its extensive capabilities and wide range of uses. I will be using pandas to store our datasets into dataframes, which are objects that are easily manipulated with many pre built functions by pandas.gdp_df = pd.read_csv("World Development Indicators.csv")
gdp_df.head()
| Country Name | Country Code | Series Name | Series Code | 2020 [YR2020] | 2019 [YR2019] | 2018 [YR2018] | 2017 [YR2017] | 2016 [YR2016] | 2015 [YR2015] | ... | 1969 [YR1969] | 1968 [YR1968] | 1967 [YR1967] | 1966 [YR1966] | 1965 [YR1965] | 1964 [YR1964] | 1963 [YR1963] | 1962 [YR1962] | 1961 [YR1961] | 1960 [YR1960] | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | GDP per capita (current US$) | NY.GDP.PCAP.CD | 508.808409485223 | 507.103391875763 | 493.756581366026 | 519.888912573073 | 509.220100485356 | 578.466352941708 | ... | 129.329760364199 | 129.108310965006 | 160.898434161304 | 137.594297961115 | 101.108325163758 | 82.0953065340517 | 78.7064287776753 | 58.4580086983139 | 59.8608999923829 | 59.7732337032148 |
| 1 | Afghanistan | AFG | GDP (current US$) | NY.GDP.MKTP.CD | 19807067268.1084 | 19291104007.6135 | 18353881129.5246 | 18869945678.4215 | 18017749073.6362 | 19907111418.9938 | ... | 1408888922.22222 | 1373333366.66667 | 1673333417.77778 | 1399999966.66667 | 1006666637.77778 | 800000044.444444 | 751111191.111111 | 546666677.777778 | 548888895.555556 | 537777811.111111 |
| 2 | Afghanistan | AFG | GDP per capita growth (annual %) | NY.GDP.PCAP.KD.ZG | -4.16819108714095 | 1.53563667424 | -1.19490038335884 | 0.0647641950115201 | -0.541416195952564 | -1.62285718631547 | ... | .. | .. | .. | .. | .. | .. | .. | .. | .. | .. |
| 3 | Albania | ALB | GDP per capita (current US$) | NY.GDP.PCAP.CD | 5215.27675237003 | 5355.84779459031 | 5284.38018438156 | 4531.02080555986 | 4124.05572595352 | 3952.80121524465 | ... | .. | .. | .. | .. | .. | .. | .. | .. | .. | .. |
| 4 | Albania | ALB | GDP (current US$) | NY.GDP.MKTP.CD | 14799615097.1008 | 15286612572.6895 | 15147020535.3869 | 13019693450.8816 | 11861200797.4706 | 11386846319.1589 | ... | .. | .. | .. | .. | .. | .. | .. | .. | .. | .. |
5 rows × 65 columns
# to have a better idea of how the dataframe looks like
# (rows, cols)
gdp_df.shape
(803, 65)
# skip the first 6 rows, since they contain irrelevant information and our table starts after them
disas_df = pd.read_excel('Technological and Natural Disasters.xlsx', skiprows= 6, index_col=0)
disas_df.head()
| Year | Seq | Glide | Disaster Group | Disaster Subgroup | Disaster Type | Disaster Subtype | Disaster Subsubtype | Event Name | Country | ... | Reconstruction Costs, Adjusted ('000 US$) | Insured Damages ('000 US$) | Insured Damages, Adjusted ('000 US$) | Total Damages ('000 US$) | Total Damages, Adjusted ('000 US$) | CPI | Adm Level | Admin1 Code | Admin2 Code | Geo Locations | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dis No | |||||||||||||||||||||
| 1900-9002-CPV | 1900 | 9002 | NaN | Natural | Climatological | Drought | Drought | NaN | NaN | Cabo Verde | ... | NaN | NaN | NaN | NaN | NaN | 3.221647 | NaN | NaN | NaN | NaN |
| 1900-9001-IND | 1900 | 9001 | NaN | Natural | Climatological | Drought | Drought | NaN | NaN | India | ... | NaN | NaN | NaN | NaN | NaN | 3.221647 | NaN | NaN | NaN | NaN |
| 1901-0003-BEL | 1901 | 3 | NaN | Technological | Technological | Industrial accident | Explosion | NaN | Coal mine | Belgium | ... | NaN | NaN | NaN | NaN | NaN | 3.221647 | NaN | NaN | NaN | NaN |
| 1902-0012-GTM | 1902 | 12 | NaN | Natural | Geophysical | Earthquake | Ground movement | NaN | NaN | Guatemala | ... | NaN | NaN | NaN | 25000.0 | 746154.0 | 3.350513 | NaN | NaN | NaN | NaN |
| 1902-0003-GTM | 1902 | 3 | NaN | Natural | Geophysical | Volcanic activity | Ash fall | NaN | Santa Maria | Guatemala | ... | NaN | NaN | NaN | NaN | NaN | 3.350513 | NaN | NaN | NaN | NaN |
5 rows × 49 columns
# to have a better idea of how the dataframe looks like
# (rows, cols)
disas_df.shape
(25307, 49)
# drops rows after the last country name, which is Zimbabwe
gdp_df = gdp_df.iloc[:651]
# drops rows with NaN entries under Series Name column
gdp_df = gdp_df.dropna(subset=['Series Name'])
gdp_df = gdp_df[gdp_df["Series Name"] != "GDP per capita growth (annual %)"]
gdp_df = gdp_df.drop(['Series Code'], axis=1)
gdp_df = gdp_df.melt(id_vars=["Country Name", "Country Code", "Series Name"],
var_name="Date",
value_name="Value")
gdp_df
| Country Name | Country Code | Series Name | Date | Value | |
|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | GDP per capita (current US$) | 2020 [YR2020] | 508.808409485223 |
| 1 | Afghanistan | AFG | GDP (current US$) | 2020 [YR2020] | 19807067268.1084 |
| 2 | Albania | ALB | GDP per capita (current US$) | 2020 [YR2020] | 5215.27675237003 |
| 3 | Albania | ALB | GDP (current US$) | 2020 [YR2020] | 14799615097.1008 |
| 4 | Algeria | DZA | GDP per capita (current US$) | 2020 [YR2020] | 3310.38653352368 |
| ... | ... | ... | ... | ... | ... |
| 26469 | Yemen, Rep. | YEM | GDP (current US$) | 1960 [YR1960] | .. |
| 26470 | Zambia | ZMB | GDP per capita (current US$) | 1960 [YR1960] | 232.188564468962 |
| 26471 | Zambia | ZMB | GDP (current US$) | 1960 [YR1960] | 713000000 |
| 26472 | Zimbabwe | ZWE | GDP per capita (current US$) | 1960 [YR1960] | 278.81384676855 |
| 26473 | Zimbabwe | ZWE | GDP (current US$) | 1960 [YR1960] | 1052990400 |
26474 rows × 5 columns
gdp_df = gdp_df.set_index(['Country Name', 'Country Code', 'Date', 'Series Name']).unstack("Series Name")
gdp_df = gdp_df.reset_index()
gdp_df = gdp_df.set_axis(['Country', 'ISO', 'Date', 'GDP', 'GDP Per Capita'], axis=1, inplace=False)
gdp_df
| Country | ISO | Date | GDP | GDP Per Capita | |
|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1960 [YR1960] | 537777811.111111 | 59.7732337032148 |
| 1 | Afghanistan | AFG | 1961 [YR1961] | 548888895.555556 | 59.8608999923829 |
| 2 | Afghanistan | AFG | 1962 [YR1962] | 546666677.777778 | 58.4580086983139 |
| 3 | Afghanistan | AFG | 1963 [YR1963] | 751111191.111111 | 78.7064287776753 |
| 4 | Afghanistan | AFG | 1964 [YR1964] | 800000044.444444 | 82.0953065340517 |
| ... | ... | ... | ... | ... | ... |
| 13232 | Zimbabwe | ZWE | 2016 [YR2016] | 20548678100 | 1464.58895715841 |
| 13233 | Zimbabwe | ZWE | 2017 [YR2017] | 19015327919.1087 | 1335.66506432532 |
| 13234 | Zimbabwe | ZWE | 2018 [YR2018] | 19523622341.6133 | 1352.16265310562 |
| 13235 | Zimbabwe | ZWE | 2019 [YR2019] | 16932434838.6928 | 1156.15486360139 |
| 13236 | Zimbabwe | ZWE | 2020 [YR2020] | 16768513442.6415 | 1128.21071129808 |
13237 rows × 5 columns

def set_Datetime(gdp_df):
temp = gdp_df.Date.copy()
gdp_df.Date = np.nan
new = gdp_df.Date
for i in range(len(temp)):
date = temp[i]
m = re.match("^(\d\d\d\d)", date)
month = np.random.randint(1, 11 + 1)
day = np.random.randint(1, 28 + 1)
if m:
year = int(m.group(1))
new[i] = datetime(year, month, day)
else:
print("error!")
new[i] = np.nan
gdp_df.Date = new
return gdp_df
gdp_df = set_Datetime(gdp_df)
def fix_values_gdp(x):
value = x.GDP
if(value == ".."):
return np.nan
else:
return int(float(value))
gdp_df.GDP = gdp_df.apply(fix_values_gdp, axis=1)
def fix_values_gdp_perCapita(x):
value = x["GDP Per Capita"]
if(value == ".."):
return np.nan
else:
return int(float(value))
gdp_df["GDP Per Capita"] = gdp_df.apply(fix_values_gdp_perCapita, axis=1)
gdp_df = gdp_df[gdp_df["Date"] >= datetime(1970, 1, 1)]
gdp_df = gdp_df.reset_index(drop = True)
gdp_df.head()
| Country | ISO | Date | GDP | GDP Per Capita | |
|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1970-11-23 00:00:00 | 1.748887e+09 | 156.0 |
| 1 | Afghanistan | AFG | 1971-09-22 00:00:00 | 1.831109e+09 | 159.0 |
| 2 | Afghanistan | AFG | 1972-01-15 00:00:00 | 1.595555e+09 | 135.0 |
| 3 | Afghanistan | AFG | 1973-08-20 00:00:00 | 1.733333e+09 | 143.0 |
| 4 | Afghanistan | AFG | 1974-09-26 00:00:00 | 2.155555e+09 | 173.0 |

disas_df = disas_df.reset_index() # since Dis No is set to be the index, we need to reset the index
unwantedCols = ['Dis No', "Year", "Seq", "Glide", "Disaster Subsubtype", "Event Name", "Location", "Origin",
"Associated Dis", "Associated Dis2", "OFDA Response", "Appeal", "Declaration", "Aid Contribution",
"Dis Mag Value", "Dis Mag Scale", "Local Time", "River Basin", "End Year", "End Year", "End Month",
"End Day", "No Injured", "No Affected", "No Homeless", "Total Affected",
"Reconstruction Costs ('000 US$)", "Reconstruction Costs, Adjusted ('000 US$)",
"Insured Damages ('000 US$)", "Insured Damages, Adjusted ('000 US$)",
"Total Damages ('000 US$)", "Total Damages, Adjusted ('000 US$)",
"Adm Level", "Admin1 Code", "Admin2 Code", "Geo Locations"]
disas_df = disas_df.drop(unwantedCols, axis=1)
disas_df = disas_df[disas_df["Start Year"] >= 1970]
disas_df = disas_df[disas_df["Start Year"] <= 2020]
disas_df = disas_df[disas_df["Disaster Subgroup"] != "Extra-terrestrial"]
disas_df = disas_df[disas_df["Disaster Type"] != "Animal accident"]
disas_df = disas_df.dropna(subset = ["Total Deaths", "ISO"])
disas_df = disas_df.reset_index(drop=True)
disas_df.head()
| Disaster Group | Disaster Subgroup | Disaster Type | Disaster Subtype | Country | ISO | Region | Continent | Latitude | Longitude | Start Year | Start Month | Start Day | Total Deaths | CPI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Technological | Technological | Miscellaneous accident | Fire | United Arab Emirates (the) | ARE | Western Asia | Asia | NaN | NaN | 1970 | 5.0 | NaN | 41.0 | 15.001282 |
| 1 | Technological | Technological | Miscellaneous accident | Collapse | Argentina | ARG | South America | Americas | NaN | NaN | 1970 | 1.0 | 4.0 | 25.0 | 15.001282 |
| 2 | Natural | Hydrological | Flood | NaN | Argentina | ARG | South America | Americas | NaN | NaN | 1970 | 1.0 | 4.0 | 36.0 | 15.001282 |
| 3 | Natural | Meteorological | Storm | Tropical cyclone | Australia | AUS | Australia and New Zealand | Oceania | NaN | NaN | 1970 | 1.0 | NaN | 13.0 | 15.001282 |
| 4 | Technological | Technological | Miscellaneous accident | Other | Belgium | BEL | Western Europe | Europe | NaN | NaN | 1970 | 5.0 | 23.0 | 1.0 | 15.001282 |
def set_Datetime(disas_df):
#add a column called "Date" filled with NaN objects
disas_df["Date"] = np.nan
for i in range(disas_df.shape[0]):
year = disas_df["Start Year"][i]
if(np.isnan(year)): # return NaN so we can get rid of that row later
return np.nan
else:
year = int(year)
month = disas_df["Start Month"][i]
day = disas_df["Start Day"][i]
if(np.isnan(month)):
month = np.random.randint(1, 28 + 1)
else:
month = int(month)
if(np.isnan(day)):
day = np.random.randint(1, 11 + 1)
else:
day = int(day)
try: # this is to fix some issues with the dataset. For example, day is set to 31 while the month has 30 days
disas_df["Date"][i] = datetime(year, month, day)
except:
day = np.random.randint(1, 28 + 1)
month += 1
if(month > 12):
month = 1
disas_df["Date"][i] = datetime(year, month, day)
return disas_df
# populate the Date column
disas_df = set_Datetime(disas_df)
unwantedCols = ["Start Month", "Start Day"]
disas_df = disas_df.drop(unwantedCols, axis=1)
disas_df = disas_df.rename(columns = {"Start Year": "Year"})
disas_df.head()
def adjust_Latitude(x):
value = x.Latitude
try:
return float(value)
except:
return np.nan
disas_df.Latitude = disas_df.apply(adjust_Latitude, axis=1)
def adjust_Longitude(x):
value = x.Longitude
try:
return float(value)
except:
return np.nan
disas_df.Longitude = disas_df.apply(adjust_Longitude, axis=1)
def adjust_Subtype(x):
value = x["Disaster Subtype"]
if(type(value) == float):
return np.nan
m = re.match("NaN|nan", value)
if(m):
return np.nan
else:
return value
disas_df["Disaster Subtype"] = disas_df.apply(adjust_Subtype, axis=1)
gdp_df.ISO = gdp_df.ISO.astype(str)
disas_df.ISO = disas_df.ISO.astype(str)

# copy them for later use
disas_df_copy = disas_df.copy()
gdp_df_copy = gdp_df.copy()
deaths_time_df = disas_df[["Total Deaths", "Date"]]
sns.set(rc={'figure.figsize':(15,20)})
sns.set(font_scale = 1.5)
myplot = sns.scatterplot( data=deaths_time_df, x="Date", y=deaths_time_df["Total Deaths"]).set(title='Deaths by Disasters over Time')

In this situation, we can trim 1%, 3%, or even 10% of our data based on Total Deaths values. However, we can also change the scale of our plot such that we can see all the values clearly. For that comes the log function!!!
In this tutorial, we will demonstrate both approaches for you :)
deaths_time_df = disas_df[["Total Deaths", "Date"]]
sns.set(rc={'figure.figsize':(15,20)})
sns.set(font_scale = 1.5)
sns.scatterplot( data=deaths_time_df, x="Date",
y=np.log10(deaths_time_df["Total Deaths"])).set(title='Deaths by Disasters over Time')
plt.ylabel("Log_10 of Deaths")
Text(0, 0.5, 'Log_10 of Deaths')

g = sns.FacetGrid(disas_df, col="Disaster Group", height = 5, aspect = 1, sharey = True)
g.map(sns.histplot, "Date")
<seaborn.axisgrid.FacetGrid at 0x7fd0a7c95710>
As expected, it seems that Technological disasters happen in less frequencies compared to natural disasters.
Lets use a pie plot to further explore how the disasters are distributed
subgrouped_df = disas_df.groupby("Disaster Subgroup").count()
data = subgrouped_df.ISO
labels = subgrouped_df.index.array
colors = sns.color_palette('pastel')[0:len(labels)]
sns.set(rc={'figure.figsize':(16,16)})
sns.set(font_scale = 1.4)
plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%')
plt.title("Percentage of Disaster Occurences")
plt.show()
groupedby_summed = disas_df.groupby(by=["Disaster Subgroup"], as_index=False).sum()
data = groupedby_summed["Total Deaths"]
labels = groupedby_summed["Disaster Subgroup"]
colors = sns.color_palette('pastel')[0:len(labels)]
sns.set(rc={'figure.figsize':(16,16)})
sns.set(font_scale = 1.4)
plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%')
plt.title("Percentage of Deaths by Disasters")
plt.show()
Interestingly, we can see that although Technological disasters happen 44% of the time, they are only responsible for 18% of deaths. Also, Climatological disasters seem very deadly.
sns.set_theme(style="whitegrid")
sns.set(font_scale = 1.4)
sns.boxplot(x="Disaster Subgroup", y =np.log10(disas_df["Total Deaths"]), data = disas_df)
plt.ylabel("Log_10 of Deaths")
Text(0, 0.5, 'Log_10 of Deaths')
It appears that our data has some outliers that are making our data too skewed. To deal with outliers, we can trim the lower and upper 10% of each disaster's subgroup.
res_df = disas_df[:0]
subGroups = disas_df["Disaster Subgroup"].unique()
for i in range(len(subGroups)):
subgroup = subGroups[i]
curr_df = disas_df[disas_df["Disaster Subgroup"] == subgroup].sort_values(by=["Total Deaths"])
start = int(curr_df.shape[0]*1/100)
if(start == 0):
start = 1
end = int(curr_df.shape[0]*99/100)
res_df = res_df.append(curr_df[start:end])
disas_df = res_df
disas_df = disas_df.reset_index(drop=True)
sns.boxplot(x="Disaster Subgroup", y =np.log10(disas_df["Total Deaths"]), data = disas_df)
plt.ylabel("Log_10 of Deaths")
Text(0, 0.5, 'Log_10 of Deaths')
We can see that the most extreme of the outliers were removed
#The only columns that make sense are Disaster Subgroup and Total Deaths, which is all we need here
groupedby_summed = disas_df.groupby(by=["Disaster Subgroup"], as_index=False).sum()
groupedby_summed
| Disaster Subgroup | Latitude | Longitude | Year | Total Deaths | CPI | |
|---|---|---|---|---|---|---|
| 0 | Biological | 0.000000 | 0.000000 | 2379092 | 192464.0 | 81281.374483 |
| 1 | Climatological | 0.000000 | 0.000000 | 458575 | 153612.0 | 16277.081566 |
| 2 | Geophysical | 11479.411879 | 30281.462367 | 1693892 | 385919.0 | 52797.426618 |
| 3 | Hydrological | 15540.659095 | 34940.138000 | 8599182 | 211945.0 | 309357.646746 |
| 4 | Meteorological | 2649.932830 | 4003.037000 | 6501586 | 178210.0 | 223011.935103 |
| 5 | Technological | 0.000000 | 0.000000 | 15699050 | 273550.0 | 550324.551729 |
data = groupedby_summed["Total Deaths"]
labels = groupedby_summed["Disaster Subgroup"]
colors = sns.color_palette('pastel')[0:len(labels)]
sns.set(rc={'figure.figsize':(16,16)})
sns.set(font_scale = 1.4)
plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%')
plt.title("Percentage of Deaths by Disasters")
plt.show()
Interestingly, we can see that deaths by technological disasters suddenly increased after trimming
sns.set(rc={'figure.figsize':(15,15)})
sns.set(font_scale = 1.4)
sns.barplot(data = groupedby_summed, x = "Disaster Subgroup", y = "Total Deaths")
plt.title("Deaths by Disaster Subgroup")
plt.ylabel("Deaths")
plt.xlabel("Disaster Subgroup")
plt.show()
disas_df["Cumulative"] = np.nan
temp = disas_df["Cumulative"].copy()
disas_df = disas_df.sort_values(by=["Date"])
disas_df = disas_df.reset_index(drop=True)
summ = 0
for i in range(disas_df.shape[0]):
summ += disas_df["Total Deaths"][i]
temp[i] = summ
disas_df["Cumulative"] = temp
sns.set(font_scale = 1.4)
sns.set(rc={'figure.figsize':(15,15)})
sns.lineplot(data=disas_df, x="Date", y="Cumulative").set(title='Deaths by Disasters over Time')
[Text(0.5, 1.0, 'Deaths by Disasters over Time')]
g = sns.FacetGrid(disas_df, col="Disaster Subgroup", height = 10, aspect = 1, sharey = False)
g.map(sns.scatterplot, "Date", "Total Deaths")
<seaborn.axisgrid.FacetGrid at 0x7fd0a62cebd0>
To plot by commulative deaths by subgroup, we need to sum each subgroup individually, which is what the following does...
subGroups = disas_df["Disaster Subgroup"].unique()
disas_df["Cumulative Subgroup"] = np.nan
res_disas_df = disas_df[:0].copy()
for sub in subGroups:
curr = disas_df[disas_df["Disaster Subgroup"] == sub]
curr = curr.sort_values(by=["Date"])
curr = curr.reset_index(drop = True)
temp = disas_df["Cumulative Subgroup"].copy()
summ = 0
for i in range(curr.shape[0]):
summ += curr["Total Deaths"][i]
temp[i] = summ
curr["Cumulative Subgroup"] = temp
res_disas_df = res_disas_df.append(curr)
disas_df = res_disas_df
disas_df = disas_df.sort_values(by=["Date"])
disas_df = disas_df.reset_index(drop = True)
g = sns.FacetGrid(disas_df, col="Disaster Subgroup", height = 8, aspect = 1, sharey = False)
g.map(sns.lineplot, "Date", "Cumulative Subgroup")
<seaborn.axisgrid.FacetGrid at 0x7fd0a6246510>
subGroups = disas_df["Disaster Subgroup"].unique()
colors = sns.color_palette('bright')[0:len(subGroups)]
sns.set(rc={'figure.figsize':(15,15)})
sns.lineplot(data=disas_df, x="Date", y="Cumulative Subgroup", hue="Disaster Subgroup", palette = colors).set(title='Deaths by Disasters over Time')
[Text(0.5, 1.0, 'Deaths by Disasters over Time')]
We can see from the graphs above that technological disasters have skyrocketed in recent years, and they are mostly responsible for the jump in deaths since the 1980s until around 2010. That is because we can observe exponential growth in the number of deaths because of technological disasters in that period
groupedby_summed = disas_df.groupby(by=["Disaster Type"], as_index=False).sum()
data = groupedby_summed["Total Deaths"]
labels = groupedby_summed["Disaster Type"]
colors = sns.color_palette('pastel')[0:len(labels)]
plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%')
plt.title("Percentage of Deaths by Disasters")
plt.show()
groupedby_summed = disas_df.groupby(by=["Disaster Type"], as_index=False).count()
data = groupedby_summed["Total Deaths"]
labels = groupedby_summed["Disaster Type"]
colors = sns.color_palette('pastel')[0:len(labels)]
plt.pie(data, labels = labels, colors = colors, autopct='%.0f%%')
plt.title("Percentage of Disaster Occurrences")
plt.show()
combine = ["Drought", "Wildfire", "Volcanic activity", "Mass movement (dry)"]
temp = disas_df["Disaster Type"].copy()
for i in range(disas_df.shape[0]):
if(temp[i] in combine):
temp[i] = "Other Natural"
disas_df["Disaster Type"] = temp
Just like we did before, we will use the Pandas.groupby() funciton. It will let us easily sum or count or do many other opetions on our dataframe. Learn more about Pandas.groupby(). To get the names of the grouped column, we can run df.index.array.
sns.set(font_scale = 1.4)
df = disas_df.groupby(by =["Continent"]).count()
sns.barplot(data = df, x = df.index.array, y = "ISO")
plt.title("Occurences by Continent")
plt.ylabel("Times")
plt.xlabel("Continent")
plt.show()
df = disas_df.groupby(by =["Continent"]).sum()
sns.barplot(data = df, x = df.index.array, y = "Total Deaths")
plt.title("Deaths by Continent")
plt.ylabel("Deaths")
plt.xlabel("Continent")
plt.show()
To do this, we will use plotly, which is a scientific graphing library. It allows as to create awesome graphs very easily!
Additionally, one of my favorite features of plotly is that plotly let us have a continues heat bar.
Learn more about plotly.
coor_disas_df = disas_df.dropna(subset=["Longitude", "Latitude"]).copy()
fig = px.scatter_geo(coor_disas_df,lat='Latitude',lon='Longitude', color=np.log2(coor_disas_df["Total Deaths"]),
hover_data=["Country", "Total Deaths", "Disaster Subtype"]) # hovering the mouse will display these
fig.update_layout(title = 'World map', title_x=0.5)
fig.update_layout(coloraxis_colorbar=dict(
title="Deaths",
tickvals=[1, 3, 5, 7, 9, 10, 12],
ticktext=["2", "10", "30", "125", "500", "1k", "4k"],
))
fig.show()
fig = px.scatter_geo(coor_disas_df,lat='Latitude',lon='Longitude',
color=np.log2(coor_disas_df["Total Deaths"]), facet_col="Disaster Subgroup",
hover_data=["Country", "Total Deaths", "Disaster Subtype"])
fig.update_layout(title = 'World map', title_x=0.5)
fig.update_layout(coloraxis_colorbar=dict(
title="Deaths",
tickvals=[1, 3, 5, 7, 9, 12],
ticktext=["2", "10", "30", "125", "512", "4096"],
))
fig.show()
fig = px.scatter_geo(coor_disas_df,lat='Latitude',lon='Longitude', color=coor_disas_df["Disaster Subgroup"],
hover_data=["Country", "Total Deaths", "Disaster Subtype"])
fig.update_layout(title = 'World map', title_x=0.5)
fig.show()
We can alos plot multiple graphs next to each other with the argument facet_col...
fig = px.scatter_geo(coor_disas_df,lat='Latitude',lon='Longitude', color=coor_disas_df["Disaster Subgroup"],
facet_col="Disaster Subgroup", hover_data=["Country", "Total Deaths", "Disaster Subtype"])
fig.update_layout(title = 'World map', title_x=0.5)
fig.show()
fig = px.scatter_geo(coor_disas_df,lat='Latitude',lon='Longitude', color=coor_disas_df["Year"],
hover_data=["Country", "Total Deaths", "Disaster Subtype"])
fig.update_layout(title = 'World map', title_x=0.5)
fig.show()
fig = px.scatter_geo(coor_disas_df,lat='Latitude',lon='Longitude', color=coor_disas_df["Year"],
facet_col="Disaster Subgroup", hover_data=["Country", "Total Deaths", "Disaster Subtype"])
fig.update_layout(title = 'World map', title_x=0.5)
fig.show()
Looking at the graphs, it seems that most of the data with coordinates are recent. This is no coincidence since measuring coordinates require some techinological advancement that is more common nawadays.
countries = disas_df["ISO"].unique()
toRemove = []
for index, row in gdp_df.iterrows():
if(row.ISO in countries):
pass
else:
toRemove.append(index)
gdp_df = gdp_df.drop(toRemove)
gdp_df = gdp_df.reset_index(drop = True)
fig = px.line(gdp_df, x="Date", y="GDP", color='ISO', width=1000, height=1500)
fig.update_layout(title = 'World GDP', title_x=0.5)
fig.show()
fig = px.line(gdp_df, x="Date", y="GDP Per Capita", color='ISO', width=1000, height=1500)
fig.update_layout(title = 'World GDP Per Capita', title_x=0.5)
fig.show()
df = gdp_df[gdp_df.Date >= datetime(2020, 1, 1)]
fig = px.choropleth(df, locations="ISO",
color="GDP", # lifeExp is a column of gapminder
hover_name="ISO" # column to add to hover information
)
fig.update_layout(title = 'World GDP', title_x=0.5)
fig.show()
fig = px.choropleth(df, locations="ISO",
color="GDP Per Capita", # lifeExp is a column of gapminder
hover_name="ISO" # column to add to hover information
)
fig.update_layout(title = 'World GDP Per Capita', title_x=0.5)
fig.show()
df = disas_df.groupby("ISO").sum()
df["ISO"] = df.index.array
fig = px.choropleth(df, locations="ISO",
color="Total Deaths", # lifeExp is a column of gapminder
hover_name="ISO" # column to add to hover information
)
fig.update_layout(title = 'World Deaths', title_x=0.5)
fig.show()
Now that we explored our data, we can start the next phase of the data science pipeline
In this stage, we will go through some topics related to machine learning and model creatoin. This section will teach you ways to predict some information based on predefined variables.
We will cover two main topics, Ordinary Least Squares and Regression analysis
Before we start. It is better to have column names that do not have white spaces here. That is because some of the function we will use next assume the column names do not have white spaces.
Also, lets add the following column to our disas_df dataframe: GDP_PC
# add Year column
gdp_df["Year"] = np.nan
for index, row in gdp_df.iterrows():
gdp_df["Year"][index] = row.Date.year
def get_gdp_per_capita(iso, year):
for index, row in gdp_df.iterrows():
gpd_per_capita = gdp_df["GDP Per Capita"][index]
curr_year = row.year
curr_iso = row.ISO
if(iso == curr_iso and year == curr_year):
return gpd_per_capita
return np.nan
# add a year column
disas_df["Year"] = np.nan
for index, row in disas_df.iterrows():
disas_df["Year"][index] = row.Date.year
# add a gdp per capita column
disas_df = pd.merge(disas_df, gdp_df[["GDP Per Capita", "ISO", "Year"]], on=["ISO", "Year"], how='left')
disas_df = disas_df.rename(columns={"GDP Per Capita": "GDP_PC"})
# for models stage
disas_df_mdl = disas_df.rename(columns = {"Total Deaths": "Total_Deaths", "Disaster Subtype" : "Disaster_Subtype",
"Disaster Type":"Disaster_Type", "Disaster Subgroup":"Disaster_Subgroup",
"Disaster Subgroup":"Disaster_Subgroup", "Disaster Group":"Disaster_Group"})
To find out what this coefficient is, we can use the statemodels library.
Also, regarding NaN entries, in this tutorial, we are simply dropping NaN entries when we build our prediction models.
# Since we have some NaN entries, drop rows with NaN entries under the needed columns
disas_df_curr = disas_df_mdl.dropna(subset=['Total_Deaths', "GDP_PC"])
disas_df_curr = disas_df_curr.sort_values(by=["Date"])
disas_df_curr = disas_df_curr.reset_index(drop = True)
# lets divide our data
X = np.array(disas_df_curr[["GDP_PC"]])
Y = np.array(disas_df_curr[["Total_Deaths"]])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
disas_df_curr.head()
| Disaster_Group | Disaster_Subgroup | Disaster_Type | Disaster_Subtype | Country | ISO | Region | Continent | Latitude | Longitude | Year | Total_Deaths | CPI | Date | Cumulative | Cumulative Subgroup | GDP_PC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Natural | Biological | Epidemic | Bacterial disease | Niger (the) | NER | Western Africa | Africa | NaN | NaN | 1970.0 | 319.0 | 15.001282 | 1970-01-01 00:00:00 | 819.0 | 819.0 | 144.0 |
| 1 | Natural | Geophysical | Earthquake | Ground movement | China | CHN | Eastern Asia | Asia | 24.185 | 102.543 | 1970.0 | 10000.0 | 15.001282 | 1970-01-04 00:00:00 | 10857.0 | 10000.0 | 113.0 |
| 2 | Natural | Meteorological | Storm | Tropical cyclone | Australia | AUS | Australia and New Zealand | Oceania | NaN | NaN | 1970.0 | 13.0 | 15.001282 | 1970-01-04 00:00:00 | 857.0 | 13.0 | 3299.0 |
| 3 | Technological | Technological | Miscellaneous accident | Collapse | Argentina | ARG | South America | Americas | NaN | NaN | 1970.0 | 25.0 | 15.001282 | 1970-01-04 00:00:00 | 844.0 | 25.0 | 1322.0 |
| 4 | Natural | Hydrological | Flood | NaN | Argentina | ARG | South America | Americas | NaN | NaN | 1970.0 | 36.0 | 15.001282 | 1970-01-04 00:00:00 | 10893.0 | 36.0 | 1322.0 |
# fitting the linear regression model with training data
df = pd.DataFrame()
df["Total_Deaths"] = pd.DataFrame(Y_train)
df["GDP_PC"] = pd.DataFrame(X_train)
model = ols("Total_Deaths ~ GDP_PC", data=df).fit()
Y_prediction = np.empty(Y_test.shape) # we will modify it soon...
for i in range(Y_prediction.shape[0]):
observation = X_test[i]
Y_prediction[i] = model.predict(exog=dict(GDP_PC=observation[0]))
plt.scatter(X_test,np.log10(np.reshape(Y_test, len(Y_prediction))), c='b', marker='x', label='Actual')
plt.scatter(X_test, np.log10(np.reshape(Y_prediction, len(Y_prediction))), c='r', marker='s', label='Prediction')
plt.legend(loc='upper left')
plt.xlabel("GDP Per Capita")
plt.ylabel("log_10 Deaths")
plt.show()
To see if our there is a correlatoin between deaths and gdp, we can also look at the statistics of our model
print(model.pvalues)
print(" ")
for val in model.pvalues:
print("is less than 0.05? : ", (val < 0.05))
Intercept 2.516541e-37 GDP_PC 9.596381e-03 dtype: float64 is less than 0.05? : True is less than 0.05? : True
model.fvalue # the greater the better
6.710602372629921
To see a summary of important statisics of our model, we can simply use the summary() function!
# rest of statistics
model.summary()
| Dep. Variable: | Total_Deaths | R-squared: | 0.001 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.001 |
| Method: | Least Squares | F-statistic: | 6.711 |
| Date: | Sat, 18 Dec 2021 | Prob (F-statistic): | 0.00960 |
| Time: | 17:27:14 | Log-Likelihood: | -88446. |
| No. Observations: | 11359 | AIC: | 1.769e+05 |
| Df Residuals: | 11357 | BIC: | 1.769e+05 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 79.9426 | 6.239 | 12.813 | 0.000 | 67.713 | 92.173 |
| GDP_PC | -0.0012 | 0.000 | -2.590 | 0.010 | -0.002 | -0.000 |
| Omnibus: | 29851.458 | Durbin-Watson: | 2.005 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 571999441.124 |
| Skew: | 31.099 | Prob(JB): | 0.00 |
| Kurtosis: | 1100.582 | Cond. No. | 1.54e+04 |
We can see that all p-values are less than 0.05. So, we can reject the null hypothesis. The coefficeitn of the gdp per capita is around -0.0011. This makes sense because the poorer the country is, the more likely that it will have less strict safety rules for disasters.
Our Model works, but it is still not giving us good prediction (as seen in the graph). Also, since the f-value is small, we can still say that our model is not trustworthy for predicting the number of deaths.
When we add more parameters to our model, we expect it to work better and more precisely. So, let's try to add Year as a second parameter and see what happens...
At this time, we will try to find the coefficients alpha that will make our mod
# Since we have some NaN entries, drop rows with NaN entries under the needed columns
disas_df_curr = disas_df_mdl.dropna(subset=['Total_Deaths', "GDP_PC", "Year"])
disas_df_curr = disas_df_curr.sort_values(by=["Date"])
disas_df_curr = disas_df_curr.reset_index(drop = True)
# lets divide our data
X = np.array(disas_df_curr[["GDP_PC", "Year"]])
Y = np.array(disas_df_curr[["Total_Deaths"]])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
# fitting the linear regression model with training data
df = pd.DataFrame()
df["Total_Deaths"] = pd.DataFrame(Y_train)
df[["GDP_PC", "Year"]] = pd.DataFrame(X_train)
model = ols("Total_Deaths ~ GDP_PC * Year", data=df).fit()
model.summary()
| Dep. Variable: | Total_Deaths | R-squared: | 0.009 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.008 |
| Method: | Least Squares | F-statistic: | 32.66 |
| Date: | Sat, 18 Dec 2021 | Prob (F-statistic): | 5.22e-21 |
| Time: | 17:27:15 | Log-Likelihood: | -88401. |
| No. Observations: | 11359 | AIC: | 1.768e+05 |
| Df Residuals: | 11355 | BIC: | 1.768e+05 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 1.072e+04 | 1115.339 | 9.609 | 0.000 | 8531.000 | 1.29e+04 |
| GDP_PC | -0.5182 | 0.107 | -4.827 | 0.000 | -0.729 | -0.308 |
| Year | -5.3137 | 0.557 | -9.537 | 0.000 | -6.406 | -4.222 |
| GDP_PC:Year | 0.0003 | 5.34e-05 | 4.823 | 0.000 | 0.000 | 0.000 |
| Omnibus: | 29816.092 | Durbin-Watson: | 2.012 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 568351687.771 |
| Skew: | 31.003 | Prob(JB): | 0.00 |
| Kurtosis: | 1097.077 | Cond. No. | 5.56e+09 |
print(model.pvalues)
print(" ")
for val in model.pvalues:
print("is less than 0.05? : ", (val < 0.05))
Intercept 8.867961e-22 GDP_PC 1.403513e-06 Year 1.769230e-21 GDP_PC:Year 1.431093e-06 dtype: float64 is less than 0.05? : True is less than 0.05? : True is less than 0.05? : True is less than 0.05? : True
model.fvalue
32.65651689957031
Better! We managed to increase the f-value of our model by including Year as a second parameter! The greater the F-value, the better the model is. You can read more about F-values here.
Also, feel free to try out different parameter/s to see if there is a correlation between it/them and the number of deaths.
In this stage, which is usually the last one, we try to reach conclusions about our findings.
By plotting our disasters and GDP dataframe, we found out many interesting information:
In this tutorial, we definitely did not cover everything we can learn about our dataset. That is basically because there is a lot to uncover! As you saw, there are many ways to visualize data, and sometimes it can be very hard, especially when we have multidimentional data to visualize. However, what helped us to understand our data was that we were able to tidy our datasets and make them so that it is easier to extract information from. Also, having good knowledge about the best ways to visualize the data is very important as we saw in the visualization stage where we had a scatter plot with y-axis as the log function.
There are so many things to uncover and countless possibilites, we can discover unexpected relationships between different variables in our data, like we saw with the number of deaths and the GDP per capita.
